Skip to main content
Log in

How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Inventor disambiguation is an increasingly important issue for users of patent data. We propose and test a number of refinements to the original Massacrator algorithm, originally proposed by Lissoni et al. (The keins database on academic inventors: methodology and contents, 2006) and now applied to APE-INV, a free access database funded by the European Science Foundation. Following Raffo and Lhuillery (Res Policy 38:1617–1627, 2009) we describe disambiguation as a three step process: cleaning&parsing, matching, and filtering. By means of sensitivity analysis, based on MonteCarlo simulations, we show how various filtering criteria can be manipulated in order to obtain optimal combinations of precision and recall (type I and type II errors). We also show how these different combinations generate different results for applications to studies on inventors’ productivity, mobility, and networking; and discuss quality issues related to linguistic issues. The filtering criteria based upon information on inventors’ addresses are sensitive to data quality, while those based upon information on co-inventorship networks are always effective. Details on data access and data quality improvement via feedback collection are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Access information for PatStat at: http://forums.epo.org/epo-worldwide-patent-statistical-database/ - last visited: 6/27/2014.

  2. For the definition of “precision” and “recall”, see section Cleaning & Parsing.

  3. See the post “Converting patstat text fields into plain ascii” on the RawPatentData blog (http://rawpatentdata.blogspot.com/2010/05/converting-patstat-text-fields-into.html ; last access: March, 2014).

  4. As an example, consider token "ABCABC" as t1 and token "ABCD" as t2. The bigram sets for t1 and t2 will be respectively: (AB,BC,CA,AB,BC) and (AB,BC,CD).  Applying Equation 1 returns:

    $$ 2G(t1,\,t2)\, = \,\frac{{\sqrt{(2\, - \,1)_{AB}^{2} \, + \,(2\, - \,1)_{BC}^{2} \, + \,(1\, - \,0)_{CA}^{2} \, + \,(1\, - \,0)_{CD}^{2} } }}{5\, + \,3} $$
  5. For a definition of patent family, see Martinez (2011).

  6. Huang et al.'s original formula was proposed to compare inventors with no more than one patent each. We have adapted it to the case of inventors with multiple patents.

  7. More precisely, NAFA and NAE contain matches between an inventor and one of his/her patents, and another inventor and one of his/her patents, plus information on whether the two inventors are the same person, according to information collected manually. Having been hand-checked, the matches in the benchmark databases are expected to contain neither false positives nor false negatives. Notice that both NAFA and NAE are based upon the PatStat October 2009 release. A detailed description is available online (Lissoni et al. 2010).

  8. The NAFA and NAE frontiers, include not only the most extreme points, but are extended to include all outcomes with precision and recall values higher than \( {\text{Precision}}\left( { \bar{o} } \right) \)-0.02 and \( Recall( \bar{o} ) \)-0.02 for any \( \bar{o} \). This will turn out useful for the ensuing statistical exercise.

  9. Remember that W ω k is a random variable with expected value equal to 0.5. By definition, any sample with a different mean cannot be randomly drawn, and must be considered either over- or under-represented by comparison to a random distribution.In case the estimated impact of a criterion is not significantly different than zero for recall, but positive for precision, then it is desirable to include it in any parametrization, as it increases precision at no cost in terms of recall. Conversely, any filter with zero impact on precision, but significantly negative for recall, ought to be excluded from any parametrization, as it bears a cost in the terms of the latter, and no gains in terms of precision. We have conducted this type of analysis, and found it helpful to understand the relative importance of the different filtering criteria. We do not report it for reasons of space, but it is available on request.

  10. Regression analysis can be applied to the same set of results in order to estimate the marginal impact of each filtering criterion and the Threshold on either precision and recall, other things being equal. In general, we expect all filters to bear a negative influence on recall (in that they increase the number of negative matches, both true and false), and a positive influence on precision (they eliminate false positives).

  11. The figures presented here are the result of further adjustments we introduced in order to solve transitivity problems. Transitivity problems may emerge for any triplet of inventors (such as I, J, and Z) whenever two distinct pairs are recognized to be same person (e,g, I & J and J and Z), but the same does not apply to the remaining pair (I & Z are not matched, or are considered negative matches). In this case we need to decide whether to revise the status of I & Z (and consider the two inventors as the same person as J) or the status of the other pairs (and consider either I or Z as different persons than J). When confronting this problem, we always opted for considering the two inventors the same person, then I,J and Z are the same individual according to Massacrator.

  12. Fields of chemistry and pharmaceuticals are defined as in Schmoch (2008). We consider only these fields, and years from 2000 and 2005, for ease of computation. Co-inventorship is intended as a connection between two inventors having (at least) one patent in common.

  13. On immigration of inventors, see Miguelez and Fink (2013) and Breschi et al. (2014). Both papers provide information on ongoing attempts to classify inventors according to their nationality and/or country of origin (country of birth, or of parents’ or grandparents’ birth). In the near future, it will be possible to use such information to refine new versions of Massacrator (see Conclusions).

References

  • Agrawal, A., Cockburn, I., & McHale, J. (2006). Gone but not forgotten: knowledge flows, labor mobility, and enduring social relationships. Journal of Economic Geography, 6(5), 571.

    Article  Google Scholar 

  • Azoulay, P., Ding, W., & Stuart, T. (2009). The impact of academic patenting on the rate, quality and direction of (public) research output. The Journal of Industrial Economics, 57, 637–676.

    Article  Google Scholar 

  • Balconi, M., Breschi, S., & Lissoni, F. (2004). Networks of inventors and the role of academia: an exploration of Italian patent data. Research Policy, 33(1), 127–145.

    Article  Google Scholar 

  • Barrai, I., Rodriguez-Larralde, A., Mamolini, E., & Scapoli, C. (1999). Isonymy and isolation by distance in Italy. Human biology, 71, 947–961.

    Google Scholar 

  • Bilenko, M., Kamath, B., & Mooney, R.J. (2006). Adaptive blocking: Learning to scale up record linkage, In Data Mining, 2006. ICDM’06. Sixth International Conference on. IEEE, pp. 87–96.

  • Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network analysis in the social sciences. science, 323(5916), 892–895.

    Article  Google Scholar 

  • Breschi, S., & Lissoni, F. (2005). Knowledge networks from patent data. In H. F. Moed, W. Glänzel & U. Schmoch (Eds.), Handbook of Quantitative Science and Research. Amsterdam: Springer.

  • Breschi S., & Lissoni F. (2009). Mobility of skilled workers and co-invention networks: an anatomy of localized knowledge flows. Journal of Economic Geography.

  • Breschi, S., Lissoni, F., & Montobbio, F. (2008). University patenting and scientific productivity: a quantitative study of Italian academic inventors. European Management Review, 5(2), 91–109.

    Article  Google Scholar 

  • Breschi, S., Lissoni, F., & Tarasconi, G. (2014). Inventor Data for Research on Migration & Innovation: a Survey and a Pilot. WIPO Economic Research Working Paper. N.17, World Intellectual Property Organization, Geneva.

  • Burt, R. S. (1987). Social contagion and innovation: cohesion versus structural equivalence. American journal of Sociology, 1287–1335.

  • Carayol, N., & Cassi, L. (2009). Who’s Who in Patents. A Bayesian approach. Cahiers du GREThA, 7, 07–2009.

    Google Scholar 

  • Den Besten M., Lissoni F., Maurino A., Pezzoni M., & Tarasconi G. (2012). Ape‐Inv Data Dissemination And Users’ Feedback Project”, mimeo (http://www.academicpatentig.eu).

  • Fleming, L., King, C., & Juda, A. I. (2007). Small Worlds and Regional Innovation. Organization Science, 18, 938–954.

    Article  Google Scholar 

  • Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215–239.

    Article  Google Scholar 

  • Griliches, Z. (1990). Patent statistics as economic indicators: A survey. Journal of Economic Literature, 28(4), 1661–1707.

  • Huang, H., & Walsh, J. P. (2011). A new name-matching approach for searching patent inventors. mimeo.

  • Li, G.C., Lai, R., D’Amour, A., Doolin, D.M., Sun, Y., Torvik, V.I., Yu, A.Z., & Fleming, L. (2014). Disambiguation and co-authorship networks of the US patent inventor database. Research Policy, 43(6), 941–955.

  • Lissoni, F., Coffano, M., Maurino, A., Pezzoni, M., & Tarasconi, G. (2010). APE-INV’s Name Game Algorithm Challenge: A Guideline for Benchmark Data Analysis & Reporting. mimeo.

  • Lissoni, F., Llerena, P., McKelvey, M., & Sanditov, B. (2008). Academic patenting in Europe: new evidence from the KEINS database. Research Evaluation, 17(2), 87–102.

    Article  Google Scholar 

  • Lissoni, F., Pezzoni, M., Poti, B., & Romagnosi, S. (2013). University Autonomy, the Professor Privilege and Academic Patenting: Italy, 1996–1997. Industry and Innovation, 20(5), 399–421.

    Article  Google Scholar 

  • Lissoni, F., Sanditov, B., & Tarasconi, G. (2006). The Keins database on academic inventors: methodology and contents. WP cespri, 181.

  • Marx, M., Strumsky, D., & Fleming, L. (2009). Mobility, skills, and the Michigan non-compete experiment. Management Science, 55(6), 875–889.

    Article  Google Scholar 

  • Maurino A., Li P. (2012). Deduplication of large personal database. Mimeo.

  • Miguelez, E., & Fink, C. (2013). Measuring the International Mobility of Inventors: A New Database, WIPO Economic Research Working Paper N.8, World Intellectual Property Organization, Geneva.

  • Nagaoka, S., Motohashi, K., & Goto, A. (2010). Patent statistics as an innovation indicator. Handbook of the Economics of Innovation, 2, 1083–1127.

  • On, B.-W., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework, In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, pp. 344–353.

  • Raffo, J., & Lhuillery, S. (2009). How to play the name game: patent retrieval comparing different heuristics. Research Policy, 38(10), 1617–1627.

    Article  Google Scholar 

  • Schmoch, U. (2008). Concept of a technology classification for country comparisons. Final report to the World Intellectual Property Organization (WIPO), Fraunhofer Institute for Systems and Innovation Research, Karlsruhe.

  • Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual review of information science and technology, 4(1), 31–43.

    Google Scholar 

  • Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. doi:10.1145/1552303.1552304.

    Article  Google Scholar 

  • Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi.20105.

    Article  Google Scholar 

  • Yasuda, N. (1983). Studies of isonymy and inbreeding in Japan. Human biology, 263–276.

Download references

Acknowledgements

This paper derives from research undertaken with the support of APE-INV, the Research Networking Programme on Academic Patenting in Europe, funded by the European Science Foundation.. Early drafts of benefitted from comments by participants to the APE-INV NameGame workshop series. We are also grateful to Nicolas Carayol, Lorenzo Cassi, Stephan Lhuillery and Julio Raffo for providing us with core data for the two benchmark datasets. Monica Coffano and Ernest Miguelez provided extremely valuable research assistantship. Andrea Maurino’s expertise on data quality has been extremely helpful.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michele Pezzoni.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pezzoni, M., Lissoni, F. & Tarasconi, G. How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation. Scientometrics 101, 477–504 (2014). https://doi.org/10.1007/s11192-014-1375-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-014-1375-7

Keywords

JEL Classification

Navigation